Correcting for selection bias in HIV prevalence estimates: an application of sample selection models using data from population‐based HIV surveys in seven sub‐Saharan African countries

Abstract Introduction Population‐based biomarker surveys are the gold standard for estimating HIV prevalence but are susceptible to substantial non‐participation (up to 30%). Analytical missing data methods, including inverse‐probability weighting (IPW) and multiple imputation (MI), are biased when data are missing‐not‐at‐random, for example when people living with HIV more frequently decline participation. Heckman‐type selection models can, under certain assumptions, recover unbiased prevalence estimates in such scenarios. Methods We pooled data from 142,706 participants aged 15–49 years from nationally representative cross‐sectional Population‐based HIV Impact Assessments in seven countries in sub‐Saharan Africa, conducted between 2015 and 2018 in Tanzania, Uganda, Malawi, Zambia, Zimbabwe, Lesotho and Eswatini. We compared sex‐stratified HIV prevalence estimates from unadjusted, IPW, MI and selection models, controlling for household and individual‐level predictors of non‐participation, and assessed the sensitivity of selection models to the copula function specifying the correlation between study participation and HIV status. Results In total, 84.1% of participants provided a blood sample to determine HIV serostatus (range: 76% in Malawi to 95% in Uganda). HIV prevalence estimates from selection models diverged from IPW and MI models by up to 5% in Lesotho, without substantial precision loss. In Tanzania, the IPW model yielded lower HIV prevalence estimates among males than the best‐fitting copula selection model (3.8% vs. 7.9%). Conclusions We demonstrate how HIV prevalence estimates from selection models can differ from those obtained under missing‐at‐random assumptions. Further benefits include exploration of plausible relationships between participation and outcome. While selection models require additional assumptions and careful specification, they are an important tool for triangulating prevalence estimates in surveys with substantial missing data due to non‐participation.


I N T R O D U C T I O N
Accurate HIV prevalence estimates are crucial for monitoring HIV disease burden and guiding testing and clinical management. However, population-based HIV biomarker surveys, the gold standard for estimating HIV prevalence, can be prone to substantial non-participation that is non-random [1]. Past biomarker non-response rates have been as high as 30% in nationally representative Demographic and Health Surveys (DHS) and other population-based studies [2][3][4]. When participants and non-participants differ with respect to HIV status, naïve use of prevalence data from surveys yields biased results [1]. HIV-seropositive persons may decline to test more often for various reasons. Individuals who know their serostatus may want to avoid unnecessary re-testing [5]; those who do not know, but suspect their status, may decline to test for fear of disclosure and HIV-related stigma or to prevent imposing burdens on family or caregivers [6]. Growing attention has turned to analytical methods to deal with non-ignorable missing data, such as those that are missing not at random (MNAR). If data are missing completely at random, no adjustment is needed. When HIV status is missing at random (MAR) conditional on observed variables, the use of inverse-probability weighting (IPW) or multiple imputation (MI) is sufficient to generate unbiased estimates [7,8]. However, it is not possible to empirically verify whether data are indeed MAR or instead MNAR, where unwillingness to consent to biomarker testing is associated with HIV status (actual or perceived) due to unmeasured factors. In such scenarios, HIV prevalence estimates are susceptible to bias even when using IPW or MI methods.
One method that can potentially recover unbiased prevalence estimates from MNAR data is the Heckman-type sample selection model [9,10]. Selection models operate by using the association between the probability of selection (HIV test participation) and the outcome of interest (HIV serostatus) and jointly estimating population-level prevalence for participants and non-participants. Selection models empirically require one or more selection variables-that is, variables that predict the outcome only via their association with the participation process. Previous work has proposed that interviewer identity fits this criterion for biomarker-measured outcomes, since interviewer skill and personality may affect a participant's willingness to consent for testing but could not plausibly affect their HIV status; moreover, interviewer identity is commonly available in survey data [11].
Early selection models required bivariate normally distributed errors for selection and outcome, but this requirement has now been relaxed by specifying a copula function that describes the dependence between the error terms in the outcome and selection models [12]. This development has enhanced the flexibility of selection models and may even increase their utility as a means to identify the form of nonparticipation biases in a given population, yielding insights for HIV prevention [13]. Such methodological advancements provide support for the routine adoption of selection models in epidemiological research. However, despite extensive use in economics, selection models have been used sparingly in epidemiology and typically using earlier formulations with less flexible modelling assumptions [11,[14][15][16][17].
Selection models are particularly relevant for Populationbased HIV Impact Assessments (PHIAs); these are nationally representative cross-sectional household-based surveys aimed to capture the state of the HIV epidemic in the mostaffected countries, such as sub-Saharan African countries and Haiti [18]. PHIA surveys include HIV testing via venous blood draw and other biomarker testing. PHIA response rates were generally higher than other comparable nationally representative surveys [2]. Nonetheless, blood test refusal rates in some countries have been as high as 30% for men and 19% for women [19][20][21][22][23][24][25]. While published PHIA estimates are weighted to account for non-response and the complex sampling design, these estimates relied on observed covariates and thus may not account for non-response associated with HIV status or perceptions around HIV status.
In this study, we aimed to test whether HIV prevalence estimates in seven PHIA surveys conducted in Tanzania, Uganda, Malawi, Zambia, Zimbabwe, Lesotho and Eswatini were sensitive to the method chosen to handle missing data. We compared sex-stratified HIV prevalence estimates in each country, using naïve, IPW, MI and sample selection models, to evaluate the potential benefit of accounting for MNAR data in such estimates.

Data source and sampling
Each PHIA survey was designed to characterize HIV prevalence and incidence, as well as progress towards The Joint United Nations Programme on HIV/AIDS (UNAIDS) 90-90-90 targets, referring to 90% of people living with HIV knowing their HIV status, 90% of those who know their status receiving antiretroviral therapy (ART) and 90% of those on ART achieving viral suppression [26]. Similar to DHS, the sampling approach used in PHIA begins with a first-stage sampling of census enumeration areas (EAs) with probability proportional to size within geographic strata defined by the highest subnational geographic designated area in each country, such as province or region. Within sampled EAs, households were randomly selected to participate in a interview to provide household information and a household membership roster. All rostered adults meeting country-specific age criteria (typically between 15 and 64 years of age) who had slept in the household the night before the survey were eligible to participate in an individual interview. Consenting participants were offered venous blood sample collection to conduct HIV rapid testing. Consent for the interview was obtained prior to the interview, and consent for the blood test was obtained after the interview immediately prior to testing, except in Eswatini where both consents were elicited prior to the interview. HIV test results were returned to respondents at the time of the survey, and those testing seropositive and a small sample of those testing seronegative were further provided point-of-care CD4 testing, referral to HIV care and treatment, and viral load testing, the results of which were returned to nearby health facilities within 6-12 weeks [18-25].

Measures
We defined HIV test participation as the completion of blood testing based on each country's national HIV testing algorithm; this involved sequential testing that utilized up to three HIV rapid tests administered at the home. HIV seropositivity was defined as blood test-confirmed HIV-positive status using a confirmatory HIV test; negative and indeterminate test results were classified as HIV seronegative. We selected potential covariates for inclusion in MI and selection models on the basis of hypothesized association with HIV test participation or HIV status from the existing literature. Selected covariates included: urban residential location (urban vs. rural); 5-year age-group (15-19, 20-24, 25-29, 30-34, 35-39, 40-44 or 45-49); marital status (never married, married or living together, divorced or separated, or widowed); educational attainment (no education, primary, secondary, more than secondary); ethnicity (categorical, when available); household wealth quintile within the country (based on household assets and characteristics that varied across countries [18,27]); circumcised versus uncircumcised (males only); pregnant versus not pregnant (females only); self-reported ever versus never had sex; self-reported ever versus never received HIV testing prior to the survey; province/region; and language. Interviewers' identities were recorded using unique anonymized IDs. Two different interviewers may have administered the interview and blood draw depending on training in phlebotomy and HIV testing. If the interview was administered by an interviewer without clinical training, consent for blood draw was obtained by a separate interviewer [18]. Among individuals who participated in the individual interview and were offered a blood test, we used the identity of the interviewer who elicited consent for blood testing. Among participants who did not participate in the interview, we used the identity of the interviewer who elicited consent for the interview.

Analysis
All rostered de facto household members in participating households aged 15-49 years were potentially eligible for analysis. Sample characteristics were described among the entire sample and compared to those who completed an individual interview and the further subset who completed a blood test. Blood test participation rates were calculated overall and by country, sex and age. We estimated sex-stratified HIV prevalence and 95% confidence intervals (CIs). All prevalence estimates were calculated using survey design weights to adjust for EA and household selection probabilities and household non-participation. This weighting approach allowed at minimum for all rostered de facto household members to represent the national population. Additional adjustment for interview and blood test non-participation was performed under models with different missing data approaches as described below. All analyses were conducted using the MICE and GJRM packages in R v3.6.2 [28,29].

Naïve model and maximal bounds
In naïve models, we calculated HIV prevalence as the weighted proportion of blood test participants who tested seropositive, with no further adjustment for individual-level non-participation. We also calculated maximal bounds, which represent lower and upper limits in the observed sample assuming all non-participants were HIV seronegative or HIV seropositive, respectively [30].

IPW model
IPW model estimates were additionally weighted to adjust for non-participation in the individual interview and blood testing. Survey weights were constructed using the least absolute shrinkage and selection operation (LASSO) regression and chisquared automatic interaction detector (CHAID) tree classification algorithms to select a minimal set of available variables that best predict non-response. The inverse probability of participation within strata defined by the identified set of variables was used to weight blood test respondents. This IPW approach was similar to the estimation method used in published PHIA reports [18,31], except that we omitted poststratification weighting and used Taylor series variance estimation instead of jackknife repeated replication in order to simplify comparison between models.

MI model
MI was performed using chained equations. Selected covariates were included in country-and sex-specific imputation models to generate 20 imputation datasets each. HIV prevalence was estimated by running models across datasets and pooling results as per Rubin's rules [32]. Interview and blood test non-participation weights were not used as individuallevel non-response was accounted for by MI.

Sample selection model
Selection models require specification of two equations, a selection equation for test participation and an outcome equation for HIV status [11,12]. We used a probit regression model for binary HIV status y i for individual i: i is an unobserved latent variable determining the likelihood of having HIV-positive status that depends on a vector of observed characteristics X i and random error i Similarly, we used a probit model for selection s i for individual i: i is an unobserved latent variable determining the likelihood of participation that depends on a vector of observed characteristics x i , a potentially null vector of selection variables Z i and random error u i . The predictors in X i were the same in both models, though this is not theoretically necessary. However, due to HIV-related stigma, risk factors for HIV are likely to be predictors of non-participation as well. Detailed information on the selected covariates in all models is provided in Table S1. Selection variables Z i are chosen on the basis of their predictive utility for participation but not the outcome, and are included in order to prevent collinearity between selection and outcome models [11]. Replicating prior studies, we used interviewer identity reflecting the fact that interviewer skill and personality may influence participation in HIV testing but could not plausibly affect HIV status [33], modelled as a random effect using a ridge penalty to improve convergence properties [12]. Province/region was modelled as a random effect using a Markov random field smoother [12].
We incorporated a copula function to model various forms of dependence between selection and outcome model errors [34]. We considered 19 different copulae: normal; Note: Numbers of men and women in the pooled sample, by country, urban/rural and 5-year age groups. Percentages of total sample represent the proportion of the total number of eligible household members in the total pooled sample (column percent). Percent of eligible represents the row percent of eligible household members who participated in the interview and blood test, respectively. a While most countries had two categories for urban and rural, in Lesotho, peri-urban was treated as urban for analysis.
Student's-t; Frank; Ali-Miqhail-Haq; Pluckett; Farlie-Gumbel-Morgenstern; Hougaard; Joe; Gumbel; and Clayton-the last three each rotated by 0, 90, 180 and 270 degrees. We fit sample selection models using each copula function and selected the best-fitting model using Akaike's information criterion (AIC). We explored how HIV prevalence estimates varied across copulae to assess the sensitivity of results to this model specification.

Sensitivity analyses
We conducted sensitivity analyses running all models only among individual interview participants. Compared to analyses among all eligible household members, these analyses were all additionally weighted for interview non-response at minimum. Consequently, comparisons between models accounted only for differences in missingness assumptions with respect to blood test non-participation.

Ethical approval
The PHIA protocol and data collection tools were approved by institutional review boards at Columbia University Medi-  cal Center, the United States Centers for Disease Control and Prevention and ethical review boards in each country.  Figure 1 and Table 1). Sample characteristics by country are reported in Tables S2a-2g. Consent rates varied widely by interviewer, ranging between 60% and 100%. Interviewer-level consent rates were significantly positively associated with HIV prevalence in Eswatini, Zambia and Zimbabwe, but not in Malawi, Lesotho, Tanzania or Uganda (Table S3 and Figure S1). Figure 2 and Table 2 Figure 3 shows an example of selection model copula testing in Tanzania. For men, HIV prevalence estimates from selection models under all copulae were higher than the IPW model estimate of 3.8% but two groups of estimates emerged. One group yielded similar HIV prevalence estimates to the IPW model of around 4-5%, while the second group yielded HIV prevalence estimates between 6% and 8% with CIs that did not overlap that of the IPW estimate and had substantially better AIC-based model fit. For women, we observed less variation in AIC and HIV prevalence, with all model CIs overlapping with that of the IPW model. For both sexes, the Student's-t copula had the best fit and was thus used for the main analysis. Copula tests for all countries are shown in Figure S2.

HIV prevalence estimates
Prevalence estimates for men were higher in selection models than the IPW model in Tanzania (7.9% vs. 3.8%), Uganda (4.7% vs. 4.2%), Zimbabwe (13.5% vs. 12.0%), Lesotho  (24.3% vs. 19.3%) and Eswatini (20.8% vs. 19.9%); and for women in Tanzania (7.9% vs. 6.2%), Zambia (15.5% vs. 14.3%), Zimbabwe (18.1% vs. 15.9%), Lesotho (34.9% vs. 29.7%) and Eswatini (37.5% vs. 34.3%). The differential between IPW and selection model point estimates ranged between -0.8% and 5.2%. Compared with IPW models, selection models generally yielded lower estimated female-to-male prevalence ratios; the largest change was observed in Tanzania where the female-to-male prevalence ratio disappeared from 1.63 to 1.00 (Table 2). Sensitivity analyses restricted to participants who completed an individual interview are shown in Table S4 and Figure S3. The direction and magnitude of selection bias estimated by the selection model were inconsistent with analyses using all eligible household members. Notably, however, the selection model estimates were typically higher than the IPW estimate in these analyses as well, except for Malawi and Zambia for men and Tanzania for women.

D I S C U S S I O N
In this study, we assessed the sensitivity of national HIV prevalence estimates to the assumptions made about test non-participation using data from seven population-based HIV surveys in sub-Saharan Africa. Compared to unadjusted estimates, standard methods to account for non-response under MAR assumptions (IPW and MI) produced only small changes in estimated prevalence in most countries. However, using selection models-a method that relaxes the assumption that non-participation was MAR conditional on observed characteristics-led to increased prevalence estimates compared to IPW models in several cases where MI did not. Our study replicates and extends a previous analysis using older DHS data from 12 countries, five of which were also included here (Malawi, Zambia, Eswatini, Zimbabwe and Lesotho) [16]. The impact of using selection models in our analysis was less extreme than in the earlier work, but we found that selection model estimates deviated from IPW and MI models in a greater proportion of surveys than Hogan et al. did [16]. This may reflect changes over time in the strength of association between HIV status and willingness to participate, perhaps because of reductions in HIV-related stigma due to increased availability of treatment [35]. Alternatively, the more modest differences may reflect methodological improvements, including the ability to model different forms of the relationship between selection and outcome via copula functions, and improved convergence properties via the incorporation of ridge penalties in model estimation [12,34].
Selection models potentially provide more accurate measures of an outcome when the assumptions of commonly used approaches are inappropriate. When study non-participation is driven by factors not observable to implementers, IPW or MI methods operate under incorrect assumptions regarding the mechanisms that determine missingness. IPW models can also underestimate uncertainty due to missing data in variance estimation. The selection model CIs were often, but not always, wider than those of naïve and IPW models, reflecting uncertainty in both sampling variability and regression parameters, similar to results seen in prior studies [16]. Some selection model estimates did not have wider CIs, though, possibly due to better fit achieved using the specified copula function and increased efficiency via joint regression modelling procedures [12,16]. However, even in cases where the selection model estimates were less precise, results were nonetheless statistically significantly and meaningfully different to IPW estimates in several cases. In all such cases here, when multiple copulae diverged substantially from the IPW estimate, the estimates from these copulae models were similar to one another. However, such accord may not always be the case and subject matter knowledge should be used to guide whether the selection model estimates are plausible.
The wide variability seen in estimates across different copulae suggests that selection models are in fact sensitive to this model specification. Indeed, copula selection proved to be non-trivial in several surveys where multiple copulae yielded divergent HIV prevalence estimates. As the example of Tanzania illustrated above, findings suggested that a negatively correlated copulae (Student's-t) best represented the relationship between selection and outcome, meaning that HIV-positive status was associated with lower likelihood of participation. Furthermore, Student's-t is a symmetric copula, indicating that the association was driven equally by lower participation among HIV-positive and higher participation among HIV-negative individuals. In contrast, teardrop-shaped copulae suggest that the association is stronger for certain individuals; for example, the Joe 90 rotation copula selected in Lesotho shows that the negative association between HIV status and participation is driven more by individuals who are HIV negative ( Figure S4).
We cannot rule out misspecification of the selection model equations. Our selection model results for men in Tanzania yielded a two-fold increase in HIV prevalence compared to IPW estimates despite high participation rates (84%). This suggests a degree of selection bias where approximately one-third of men who refused participation were HIV positive (compared to 3.8% among participants). The estimated correlation between selection and outcome model errors was strong (Kendall's tau: -0.548, Figure S4); however, validation requires external knowledge of study design and data collection. Similarly, MI estimates for men in Malawi and Zambia drastically diverged from the IPW models. These findings were not replicated among individual interview participants only ( Figure S3), suggesting that the selection biases likely stemmed from household non-participation. Male heads of household may disproportionately refuse when HIV positive, but this hypothesis is untestable with the available data. Future surveys should collect more robust household-level predictors of participation and HIV risk.
The validity of our selection model findings rests centrally on whether interviewer identity is a valid selection variable (i.e. it is unrelated to HIV status conditional on observed variables). This assumption is conceptually untestable. Interviewers in PHIA were intentionally matched to geographic areas based on language. However, other unplanned matching may have occurred, for example if high-performing interviewers were assigned to difficult areas, or to participants by sex or age. If unplanned matching occurred and was strongly correlated with HIV status, the exclusion criterion would be violated, although its impact on our findings cannot be discerned. Future studies could limit such violations by random-izing interviewer assignment a priori. Randomized assignment of interviewers within defined geographic areas has previously been shown to be feasible and improve identification of interviewer effects without substantial detriment to overall survey quality [36]. We are also unable to determine whether the elicitation of blood test consent prior to the interview in Eswatini (whereas it was elicited after the interview in all other countries) affected likelihood of consent and thus the selection model results. It is, for example, possible that interviewers develop rapport during the interview, increasing the likelihood of blood test participation. However, we found no stronger evidence of selection bias in Eswatini than in the other countries. Lastly, this analysis cannot account for unobserved predictors of household non-participation.

C O N C L U S I O N S
In this study, we demonstrate the utility of selection models as a valuable part of triangulation in estimating HIV prevalence [37]. While selection models require additional assumptions and careful assessment of model sensitivity, they can evaluate the potential impact of non-participation biases compared to other analytical methods where non-response is high, missingness is likely non-random conditional on observed characteristics and a plausible selection variable is available. In the context of population-based studies, this includes biomarker and other objectively verifiable outcomes, particularly those subject to stigma or fear-based perceptions (e.g. viral load for measuring treatment adherence). Selection models also enable exploration of the relationships between study participation and outcome, which is itself of scientific interest. Given the accessibility of flexible tools in readily available software, selection models can be easily implemented and have the potential to improve modelling and inference in survey analysis. Furthermore, selection model results may highlight areas for improving future surveys through enhanced interviewer training and supervision and targeted recruitment of sub-populations with known non-participation biases observed in previous surveys.

C O M P E T I N G I N T E R E S T S
The authors declare that they have no competing interests.

S U P P O R T I N G I N F O R M AT I O N
Additional information may be found under the Supporting Information tab for this article: Figure S1: Consent rates and HIV prevalence by interviewer Figure S2: Selection model results by copula Figure S3: HIV prevalence estimates among adults aged 15-49 under different missingness assumptions using data from individual interview participants only Figure S4: Copula contour plots Table S1: Model specifications Table S2a: Sample characteristics for Tanzania  Table S2b: Sample characteristics for Uganda Table S2c: Sample characteristics for Malawi Table S2d: Sample characteristics for Zambia Table S2e: Sample characteristics for Zimbabwe Table S2f: Sample characteristics for Lesotho Table S2g: Sample characteristics for Eswatini Table S3: Interviewer-level participation rate (%) and HIV test prevalence (%)